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Abstract 

We study the on-line AdaTron learning of linearly non-separable rules by a 
simple perceptron. Training examples are provided by a perceptron with a 
non-monotonic transfer function which reduces to the usual monotonic rela- 
tion in a certain limit. We find that, although the on-line AdaTron learning is 
a powerful algorithm for the learnable rule, it does not give the best possible 
generalization error for unlearnable problems. Optimization of the learning 
rate is shown to greatly improve the performance of the AdaTron algorithm, 
leading to the best possible generalization error for a wide range of the pa- 
rameter which controls the shape of the transfer function. 
PACS numbers: 87.10.+e 
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I. INTRODUCTION 



The problem of learning is one of the most interesting aspects of feed-forward neural 
networks Recent activities in the theory of learning have gradually shifted toward the 

issue of on-line learning. In the on-line learning scenario, the student is trained only by the 
most recent example which is never referred to again. In contrast, in the off-line (or batch) 
learning scheme, the student is given a set of examples repeatedly and memorizes these 
examples so as to minimize the global cost function. Therefore, the on-line learning has 
several advantages over the off-line method. For example, it is not necessary for the student 
to memorize the whole set of examples, which saves a lot of memory space. In addition, 
theoretical analysis of on-line learning is usually much less complicated than that of off-line 
learning which often makes use of the replica method. 

In many of the studies of learning, authors assume that the teacher and student networks 
have the same structures. The problem is called learnable in these cases . However, in the real 
world we find innumerable unlearnable problems where the student is not able to perfectly 
reproduce the output of teacher in principle. It is therefore both important and interesting 
to devote our efforts to the study of learning unlearnable rules. 

If the teacher and student have the same structure, a natural strategy of learning is to 
modify the weight vector of student J so that this approaches teacher's weight J° as quickly as 
possible. However, if the teacher and student have different structures, the student trained to 
satisfy J = J° sometimes cannot generalize the unlearnable rule better than the student with 
J^J°. Several years ago, Watkin and Rau [|J investigated the off-line learning of unlearnable 
rule where the teacher is a perceptron with a non-monotonic transfer function while the 
student is a simple perceptron. They discussed the case where the number of examples is of 
order unity and therefore did not derive the asymptotic form of the generalization error in 
the limit of large number of training examples. Furthermore, as they used the replica method 
under the replica symmetric ansatz, the result may be unstable against replica symmetry 
breaking. 
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For such a type of non-monotonic transfer function, a lot of interesting phenomena have 
been reported. For example, the critical loading rate of the model of Hopfield type |] [7j] 
or the optimal storage capacity of perceptron || is known to increase dramatically by non- 
monotonicity. It is also worth noting that perceptrons with the non-monotonic transfer 
function can be regarded as a toy model of a multilayer perceptron, a parity machine |J. 

In this context, Inoue, Nishimori and Kabashima [T(| recently investigated the problem 



of on-line learning of unlearnable rules where the teacher is a non-monotonic perceptron: 
the output of the teacher is T a (v) = sign[t>(a — v)(a + v)], where v is the input potential 
of the teacher v = \^N(J°-x.), with x being a training example, and the student is a simple 
perceptron. For this system, difficulties of learning for the student can be controlled by the 
width a of the reversed wedge. If a = oo or a = 0, the student can learn the rule perfectly 
and the generalization error decays to zero as a -1 / 3 for the conventional perceptron learning 
algorithm and a _1//2 for the Hebbian learning algorithm, where a is the number of presented 
examples, p, divided by the number of input nodes, N. For finite a, the student cannot 
generalize perfectly and the generalization error converges exponentially to a non-vanishing 
a-dependent value. 

In this paper we investigate the generalization ability of student trained by the on- 
line AdaTron learning algorithm with examples generated by the above-mentioned non- 
monotonic rule. The AdaTron learning is a powerful method for learnable rules both in 
on-line and off-line modes in the sense that this algorithm gives a fast decay, proportional 
to a -1 , of the generalization error [ 11 -ff3] , in contrast to the a" 1 / 3 and or 1 ! 2 decays of 



the perceptron and Hebbian algorithms. We investigate the performance of the AdaTron 
learning algorithm in the unlearnable situation and discuss the asymptotic behavior of the 
generalization error. 

This paper is organized as follows. In the next section, we explain the generic properties 
of the generalization error for our system and formulate the on-line AdaTron learning. Some 
of the results of our previous paper [[HJ are collected here concerning the perceptron and 
Hebbian learning algorithms which are to be compared with the AdaTron learning. Section 



Ill deals with the conventional AdaTron learning both for learnable and unlearnable rules. 
In Sec. IV we investigate the effect of optimization of the learning rate. In Sec. V the issue 
of optimization is treated from a different point of view where we do not use the parameter 
a, which is unknown to the student, in the learning rate. In last section we summarize our 
results and discuss several future problems. 



II. THE MODEL SYSTEM 

Let us first fix the notation. The input signal comes from N input nodes and is repre- 
sented by an iV-dimensional vector x. The components of x are randomly drawn from a 
uniform distribution and then x is normalized to unity. Synaptic connections from input 
nodes to the student perceptron are also expressed by an iV-dimensional vector J which is not 
normalized. The teacher receives the same input signal x through the normalized synaptic 
weight vector J°. The generalization error is e g = <C@(— T a (v)S(u))^>, where S(u) = sign(ii) 
is the student output with the internal potential u = \/iV(J-x)/|J| and <C- • stands for 
the average over the distribution function 



Pr(u.v) = exp 



(u 2 + v 2 - 2Ruv) 



(1) 



2(1 -R?) 

Here R stands for the overlap between the teacher and student weight vectors, 
R = (J°-J)/|J°||J|. This distribution has been derived from randomness of x and is valid in 
the limit N—>oo. 



The generalization error e g is easily calculated as a function of R as follows [10 



where H(x) = Dt with Dt = exp(— t 2 /2)/y / 27r. It is important that this expression is 
independent of specific learning algorithm. Minimization of E(R) with respect to R gives 
the theoretical lower bound, or the best possible value, of the generalization error for given a. 
In Fig. 1 we show E(R) for several values of a. This figure indicates that the generalization 
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error goes to zero if the student is trained so that the overlap R becomes 1 for a = oo and 



R — — 1 for a = 0. If the parameter a is larger than some critical value a c \ = v / 21og2 = 1.177, 
E(R) decreases monotonically from 1 to as R increases from —1 to 1. When a is smaller 



than a c i, a local minimum appears at R — R* = — y (21og2 — a 2 )/21og2, but the global 
minimum is still at R — 1 as long as a is larger than a C 2 = 0.80. If a is less than a C 2, the 
global minimum is found at R — i?*, not at R — 1. This situation is depicted in Figs. 
2 and 3 where we show the optimal overlap R giving the smallest value of E(R) and the 
corresponding best possible value of the generalization error as functions of a. From these 
two figures, we see that the optimal overlap which gives the theoretical lower bound shows 
a first-order phase transition at a = a C 2. 

Therefore, our efforts should be directed to finding the best strategy which gives the best 
possible value of the generalization error for a wide range of the parameter a. 



It may be useful to review some of the results of, Inoue, Nishimori and Kabashima flG 
who studied the present problem under the perceptron and Hebbian algorithms. For the 
conventional perceptron learning, the generalization error decays to zero as a -1 / 3 if the rule 
is learnable (a = oo), whereas it converges to a non- vanishing value E(R = 1 — 2A), where 
A = exp(— a 2 /2), exponentially for the unlearnable case. This value of E(R) is larger than 
the best possible value as seen in Fig. 3. Introduction of optimization processes of the 
learning rate improves the performance significantly in the sense that the generalization 
error then converges to the best possible value when a > a C 2. For the conventional Hebbian 
learning, the generalization error decays to the theoretical lower bound as ct~ 1//2 not only 
in the learnable limit a— >oo but for a finite range of a, a > a c \. However, for a < a c i, the 
generalization error does not converge to the optimal value. 

III. LEARNING DYNAMICS 

The on-line training dynamics of the AdaTron algorithm is 

J m+1 = J m - g{a)ue{-T a {v)S{u))x, (3) 



where m stands for the number of presented patterns and g(oe) is the leaning rate. It is 
straightforward to obtain the recursion equations for the overlap R m = (J m -J°)/| J m | |J°| 
and the length of the student weight vector l m = \J m \/y/~N. In the limit iV— >oo, these two 
dynamical quantities become self-averaging with respect to the random training data x. For 
continuous time a = m/N in the limit N—*oo, m^oo with a kept finite, the evolutions of 



R and I are given by the following differential equations |10 

dl g 2 E_ 



'Ad 



da 



21 



gE 



Ad 



(4) 



where 



with 



and 



dR Rg 2 E Ad gE Ad R — G A d 



da 



2P 



+ 



I 



/2 f°° 
- u 2 DuH a (u,R) 
7T JO 



H a (u,R) = H( a r R l)+H( RU ,)-H fa + Ru 



(5) 



(6) 



(7) 



G Ad = <ui;r,(i;)e(-T a (i!)S(u))» 



7T 



.(1-^2)3/2 



2 exp 



2{l-R?) 



— Ra{yl - R 2 )A 



1-2H 



1 



Ra 



+ RE Ad- 



(8) 



Equations (§) and @ determine the learning process. In the rest of the present section we 
restrict ourselves to the case of g — 1 corresponding to the conventional AdaTron learning. 



A. Learnable case 

We first consider the case of g(a) = 1 and a = oo, the learnable rule. We investigate the 
asymptotic behavior of the generalization error when R approaches 1, R = 1 — s, and 
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I — l , a constant. From Eqs. @ and (||), we find E Ad ~ c£ 3 / 2 and GAd ~ (c — 2y/2/iz) e 3 ^ 2 
with c = 8/(3^71-). Then Eq. (§) is solved as e = (2/k) 2 a~ 2 with 

k ~ 2l 2 ° + vr/o • [J) 
Using this equation and Eq. (0), we obtain the asymptotic form of the generalization error 

as 



eg = e(r) = ^1. (io) 

s 7r nk a 

The above expression of the generalization error depends on l , the asymptotic value of I, 
through k. Apparently lo is a function of the initial value of I as shown in Fig. 4. A special 
case is Iq = 1/2 in which case I does not change as learning proceeds as is apparent from 
Eq. (Q) as well as from Fig. 4. Such a constant-/ problem was studied by Biehl and Riegler 
[TTH who concluded 

for the AdaTron algorithm. Our formula (|T0D reproduces this result when Iq = 1/2. If one 
takes Iq as an adjustable parameter, it is possible to minimize e g by maximizing k in the 
denominator of Eq. ([K]). The smallest value of e g is achieved when Z = ttc/2v / 2, yielding 

£ « = s < 12) 

which is smaller than Eq. (|ll]) for a fixed /. We therefore have found that the asymptotic 
behavior of the generalization error depends upon whether or not the student weight vector 
is normalized and that a better result is obtained for the un-normalized case. We plot the 
generalization error for the present learnable case with the initial value of /j n i t = 0.1 in Fig. 
5. We see that the Hebbian learning has the highest generalization ability and the AdaTron 
learning shows the slowest decay among the three algorithms in the initial stage of learning. 
However, as the number of presented patterns increases, the AdaTron algorithm eventually 
achieves the smallest value of the generalization error. In this sense the AdaTron learning 
algorithm is the most efficient learning strategy among the three in the case of the learnable 
rule. 



B. Unlearnable case 



For unlearnable case, there can exist only one fixed point Iq = 1/2. This reason is, for 
finite a, E^d appearing in Eq. (HI) does not vanish in the limit of large a and E\^ has a 
finite value for a^oo. For this finite E&&, the above differential equation has only one fixed 
point l — 1/2. In contrast, for the learnable case, E^d behaves as -EAd~ ce 3 ^ 2 in the limit 
of a— >oo and thus dl/da becomes zero irrespective of I asymptotically. We plot trajectories 
in the R-l plane for a = 2 in Fig. 6 and the corresponding generalization error is plotted 
in Fig. 7 as an example. From Fig. 6, we see that the destination of I is 1/2 for all initial 
conditions. Figure 7 tells us that for the unlearnable case a = 2, the AdaTron learning has 
the lowest generalization ability among the three. We should notice that the generalization 
error decays to its asymptotic value, the residual error e min , as e g — e min ~a; _1 / 2 for the 
Hebbian learning and decays exponentially for perceptron learning [fToQ . The residual error 
of the Hebbian learning e m j n = 2H(a) is also the best possible value of the generalization 
error for a > a C 2 as seen in Fig. 3. In Fig. 8 we also plot the generalization error of the 
AdaTron algorithm for several values of a. For the AdaTron learning of the unlearnable 
case, the generalization error converges to a non-optimal value E(Rq) exponentially. 

For all unlearnable cases, the R-l flow is attracted into the fixed point (Rq, 1/2), where 
R is obtained from 

°P = -2G Ad (R ) = 0. (13) 

The solution Rq of the above equation is not the optimal value because the optimal value of 
the present learning system is -R opt = 1 for a > a c2 and R opt — R* — — -y/(21og2 — a 2 )/21og2 
for a < a c2 [fLOH . 

From Figs. 3 and 7, we see that the residual error e m i n of the AdaTron learning is larger 
than that of the conventional perceptron learning. Therefore, we conclude that if the student 
learns from the unlearnable rules, the on-line AdaTron algorithm becomes the worst strategy 
among three learning algorithms as we discussed above although for the learnable case, the 



S 



on-line AdaTron learning is a sophisticated algorithm and the generalization error decays to 
zero as quickly as the off-line learning |L4] . 



IV. OPTIMIZATION 

In the previous section, we saw that the on-line AdaTron learning fails to get the best 
possible value of the generalization error for the unlearnable case and its residual error e m i n 
is larger than that of the conventional perceptron learning or Hebbian learning. We show 
that it is possible to overcome this difficulty 



We now consider an optimization the learning rate g(a) fllP ]. This optimization procedure 



is different from the technique of Kinouchi and Caticha fll5 |. As the optimal value of R 
which gives the best possible value of the generalization error is R opt = 1 for a > a c2 , we 
determine g(a) so that R is accelerated to become 1. In order to determine g using the 
above strategy we maximize the right hand side of Eq. (|5|) with respect to g(a) and obtain 
g opt = (E\ d R — G Ad)/ RE Ad- Using this optimal learning rate, Eqs. (f|) and (^) are rewritten 
as follows 

dl _ {EAdR — GAd){EAdR + G*Ad) ^ 



da 2R 2 E 



Ad 

dR _ (E A dR — G*Ad) 2 p-n 
da ~ 2 RE Ad ' 

For the learnable case, we obtain the asymptotic form of the generalization error from 
Eqs. ([14]) and ( |T5D by the same relation R = 1 — e, e— >0 as we used for the case of g — 1 as 



%=3J. (16) 

This is the same asymptotic behavior as that obtained by optimizing the initial value of / 
as we saw in the previous section. 

Next we investigate the unlearnable case. The asymptotic forms of EAd and EAdR — G*Ad 
in the limit of a— >oo are obtained as 
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and 



E Ad ~2H(a) + J^-aA (17) 



ExdR — G*Ad ~ — ~7^~ ■ (18) 

V Z7T 



Then we get the asymptotic solution of Eq. ( Jl5| ) with respect to e, R = 1 — e, as 

2vrii"(a) + v^aA 1 

s = — 

4a 2 A a 

As the asymptotic behavior of E(R) is obtained as E(R) = e g = 2H(a) + \f2ej-K fllP 
find the generalization error in the limit of a— >oo as follows 



(19) 



wc 



where 2H(a) is the best possible value of the generalization error for a > a C 2- Therefore, our 
strategy to optimize the learning rate succeeds in training the student to obtain the optimal 
overlap R = 1 for a > a c2 . 

For the perceptron learning, this type of optimization failed to reach the theoretical 



lower bound of the generalization error for a exactly at a = a c \ = v21og2 in which case the 
generalization error is e g = 1/2, equivalent to a random guess because for a = a c \ optimal 
learning rate vanishes [0. In contrast, for the AdaTron learning, the optimal learning rate 
has a non-zero value even at a = a c \. In this sense, the on-line AdaTron learning with 
optimal learning rate is superior to the perceptron learning. 



V. PARAMETER-FREE OPTIMIZATION 

In the previous section, we were able to get the theoretical lower bound of the generaliza- 
tion error for a > a c2 by introducing the optimal learning rate g opt . However, as the optimal 
learning rate g opt contains a parameter a unknown to the student, the above result can be re- 
garded only as a lower bound of the generalization error. The reason is that the student can 
get information only about teacher's output and no knowledge of a or v = V / A 7 (J°"x)/|J°|. 
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In realistic situations, the student does not know a or v and therefore has a larger value 
of the generalization error. In this section, we construct a learning algorithm without the 
unknown parameter a using the asymptotic form of the optimal learning rate. 



A. Lear nab le case 

For the learnable case, the optimal learning rate is estimated in the limit of a— >oo as 

Ea&R — Ga<i ,3 , . 

9o P t = r-- 1 ~ -I. (21) 

This asymptotic form of the optimal learning rate depends on a only through the length I 
of student's weight vector. We therefore adopt g(a) proportional to I, g(a) — rjl, also in 
the case of the parameter-free optimization and adjust the parameter 77 so that the student 
obtains the best generalization ability. Substituting this expression into the differential 
equation (|5|) for R and using R — 1 — e with e^O, we get 

| = -F(,)^. (22) 

where we have set 

F^^V-i^ 2 . (23) 

This leads to e — (F(rj)/2)~ 2 a~ 2 . Then, the generalization error is obtained from e g = 
^/2e/ir as 

g 7rF(rj)a v ' 

In order to minimize e g , we maximize F(r]) with respect to 77. The optimal choice of 77 in 
this sense is r] opt = 3/2 and we find in such a case 

H = £■ (25) 
This is the same asymptotic form as the previous a-dependent result (0). 
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B. Unlearnable case 



Next we consider the unlearnable case. The asymptotic form of the learning rate we 
derived in the previous section for the unlearnable case is 

E Ad R — G Ad 



9o P t 



RE Ad 



1 = Tj — 

2H(a) + j2/^aA 



(26) 



where we used Eq. (|19D to obtain the right-most equality and we set the a-dependent 
prefactor of I as rj. Using this learning rate (|26"D and the asymptotic forms of E Ad (R = 



l-e,e->0) and G Ad (R = l-e,e->0) as E Ad ~2H(a) + yj2/TraA and G Ad ~ 4aAe/\/2iT+E Ad 
in the limit of ct^oo, we obtain the differential equation with respect to e from Eq. @ as 
follows 



de 
da 



1 








2H(a) + 




2 







rf 4a . £ 



27T a 



(27) 



This differential equation can be solved analytically as 



r] 2 (2H(a) + ^fnaA^ \ /^^aA,/^ 

a \aj 



(28) 



2 (4aAr// v / 2^ - 

where A is a constant determined by the initial condition. Therefore, if we choose i] to 
satisfy 4aA?7/v / 27r — 1 > 0, the generalization error converges to the optimal value 2H(a) as 

f2e 



2H(a) + 



2H(a) 



71 

1 

7T 



2H(a) + J2/naA \ 



(29) 



■\ 4aAr]/V27T-l ^fa' 

In order to obtain the best generalization ability, we minimize the prefactor of \j \fa in the 
second term of Eq. (|29"D and obtain 



V 



7fA 

2a' 



(30) 



For this rj, the condition AaArj / \/2n — 1 > is satisfied. In general, if we take rj independent 
of a, the condition AaAr]/^/2n — 1 > is not always satisfied. The quantity 6 = 4a A/ 'y/2n 



takes the maximum value A/y27re at a = 1. Therefore, whatever value of a we choose, we 
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cannot obtain the a x l 2 convergence if the product of this maximum value 4/y / 27re and r\ 



is not larger than unity. This means that r] should satisfy rj > v^7re/4~ 1.033 for the first 
term of Eq. dominate asymptotically, yielding Eq. (p9|), for a non- vanishing range of 
a. In contrast, if we choose rj to satisfy bi] — 1 < 0, the generalization error is dominated by 
the second term of Eq. (|28f) and behaves as 



\/2A / 77 \ 2aA7?/V27r 

2H(a) + ^ ( ? ) • (31) 



7T \«/ 

In this case, the generalization error converges less quickly than (|2~9"D. For example, if we 
choose r) — 1, we find that the condition bi] > 1 cannot be satisfied by any a and the 



generalization error converges as in Eq. (pi]). If we set rj = 2 (> V27re/4 = 1.033) as 
another example, the asymptotic form of the generalization error is either Eq. ( p9"D or Eq. 
([31]) depending on the value of a. 

VI. CONCLUSION 

We have investigated the generalization abilities of a simple perceptron trained by the 
teacher who is also a simple perceptron but has a non-monotonic transfer function using 
the on-line AdaTron algorithm. For the learnable case (a = oo), if we fix the length of the 
student weight vector as I = \J\/y/N =1/2, the generalization error converges to zero as 
~3/(2a) as Biehl and Riegler reported fill. However, if we allow the time development 



of the length of student weight vector, the asymptotic behavior of the generalization error 
shows dependence on the initial value of I. When the student starts the training process 
from the optimal length of weight vector I, we can obtain the generalization error e g ~ 4/(3a) 
which is a little faster than 3/(2a). As the student is able to know the length of its own 
weight vector in principle, we can get the better generalization ability e g ~4/(3a) by a 
heuristic search of the optimal initial value of On the other hand, if the width a of 
the reversed wedge has a finite value, the generalization error converges exponentially to a 
non-optimal a-dependent value. In addition, these residual errors are larger than those of 
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the conventional perceptron learning for the whole range of a. Therefore we conclude that, 
although the AdaTron learning is powerful for the learnable case [JTl|] including the situation 



in which the input vector is structured [13|], it is not necessarily suitable for learning of the 
non-monotonic input-output relations. 

Next we introduced the learning rate and optimized it. For the learnable case, the 
generalization error converges to zero as ~4/(3a) which is as fast as the result obtained 
by selecting the optimal initial condition for the case of non-optimization, g — 1. For this 
learnable case, the asymptotic form of the optimal leaning rate is g opt ~ 31/2. Therefore, for 
the on-line AdaTron learning, it seems that the length of the student weight vector plays 
an important role to obtain a better generalization ability. If the task is unlearnable, the 
generalization error under optimized learning rate converges to the theoretical lower bound 
2H(a) as for a > a C 2- Using this strategy, we can get the optimal residual error for a 

even exactly at a c \ for which the optimized perceptron learning failed to obtain the optimal 
residual error JT0|j . 

We also investigated the generalization ability using a parameter-free learning rate. 
When the task is learnable, we assumed g opt = 77 I and optimized the prefactor rj. As a re- 
sult, we obtained e g ~ 4/ (3a) which is the same asymptotic form as the parameter-dependent 
case. Therefore, we can obtain this generalization ability by a heuristic choice of 77; we may 
choose the best 77 by trial and error. On the other hand, for the unlearnable case, we used 
the asymptotic form of the a-dependent learning rate in the limit of a— >oo, g opt ~r]l/a, and 
optimized the coefficient 77. The generalization error then converges to 2H(a) as a~ 1//2 for 
br] > 1. If br] < 1, the generalization error decays to 2H(a) as a~ br1 ^ 2 , where the exponent 
677/2 is smaller than 1/2 because br] < 1. Similar slowing down of the convergence rate of 
the generalization error by tuning a control parameter was also reported by Kabashima and 
Shinomoto in the problem of learning of two-dimensional blurred dichotomy |16[ . 

In conclusion, we could overcome the difficulty of the AdaTron learning of unlearnable 
problems by optimizing the learning rate and the generalization error was shown to converge 
to the best possible value as long as the width a of reversed wedge satisfies a > a C 2. For the 
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parameter region a < a C 2, this approach does not work well because the optimal value of R 
is R* instead of 1; our optimization is designed to accelerate the increase to R toward 1. 

In this paper, we could construct a learning strategy suitable to achieve the a-dependent 
optimal value 2H(a) for a > a C 2. However, for a < a C 2, it is a very difficult but challenging 
future problem to get the optimal value by improving the conventional AdaTron learning. 
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FIGURES 



FIG. 1. Generalization error as a function of R for a = oo, 2, 1, 0.5 and a = 0. 

FIG. 2. Optimal overlap R which gives the best possible value and overlaps which give the 
residual error for Hebbian, perceptron and AdaTron learning algorithms. 

FIG. 3. Best possible value of the generalization error, the residual generalization errors of 
conventional Hebbian, perceptron and AdaTron learning algorithms are plotted as functions of a. 
Except for a = oo and a = 0, the AdaTron learning cannot lead the student to the best possible 
value of the generalization error. In addition, for a finite value of a, the residual generalization 
error of the AdaTron learning is larger than that of the perceptron learning. 

FIG. 4. R-l trajectories of the AdaTron learning for the learnable case a = oo. The fixed point 
depends on the initial value of / = Zi n i t . For the special case of = 0.5, the flow of / becomes 
independent of a. 

FIG. 5. Generalization errors of the AdaTron, perceptron and Hebbian learning algorithms 
for the learnable case a = oo. The initial value of I is = 0.1 for all algorithms. The AdaTron 
learning shows the fastest convergence among the three. 

FIG. 6. R-l trajectories of the AdaTron learning for the unlearnable case a = 2. All flows of I 
converge to the fixed point at Iq = 1/2. 

FIG. 7. Generalization errors of the AdaTron, perceptron and Hebbian learning algorithms 
for the unlearnable case a = 2. The AdaTron learning shows the largest residual error among the 
three. 

FIG. 8. Generalization errors of the AdaTron learning algorithm for the cases of a = oo, 2, 1 
and 0.5. 
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